Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dan Zhou

StepAudio 2.5 Technical Report

May 22, 2026

Bin Lin, Bo Zhao, Boyong Wu, Chao Yan, Chen Wu, Cheng Yi, Chengyuan Yao, Daijiao Liu, Fei Tian, Feng Tian(+91 more)

Abstract:Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across all three capabilities. Rather than treating these tasks as architecturally distinct, we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes: data construction, optimization targets, and decoding constraints. Guided by this insight, we advance the post-training paradigm from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), using it as the primary mechanism to define complex optimization targets. We leverage this RLHF-centric alignment, alongside specialized decoding, to shape a shared backbone into three distinct operational modes. Concretely, the ASR branch advances transcription efficiency via verifiable multi-token decoding; the TTS branch achieves controllable, expressive synthesis through preference-based RLHF and context-rich supervision; and the Realtime branch realizes low-latency, persona-consistent dialogue via generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across ASR, TTS, and Realtime, demonstrating that a singular audio-language foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.

Via

Access Paper or Ask Questions

AnimationBench: Are Video Models Good at Character-Centric Animation?

Apr 16, 2026

Leyi Wu, Pengjun Fang, Kai Sun, Yazhou Xing, Yinwei Wu, Songsong Wang, Ziqi Huang, Dan Zhou, Yingqing He, Ying-Cong Chen(+1 more)

Abstract:Video generation has advanced rapidly, with recent methods producing increasingly convincing animated results. However, existing benchmarks-largely designed for realistic videos-struggle to evaluate animation-style generation with its stylized appearance, exaggerated motion, and character-centric consistency. Moreover, they also rely on fixed prompt sets and rigid pipelines, offering limited flexibility for open-domain content and custom evaluation needs. To address this gap, we introduce AnimationBench, the first systematic benchmark for evaluating animation image-to-video generation. AnimationBench operationalizes the Twelve Basic Principles of Animation and IP Preservation into measurable evaluation dimensions, together with Broader Quality Dimensions including semantic consistency, motion rationality, and camera motion consistency. The benchmark supports both a standardized close-set evaluation for reproducible comparison and a flexible open-set evaluation for diagnostic analysis, and leverages visual-language models for scalable assessment. Extensive experiments show that AnimationBench aligns well with human judgment and exposes animation-specific quality differences overlooked by realism-oriented benchmarks, leading to more informative and discriminative evaluation of state-of-the-art I2V models.

* Project Page: https://animationbench.github.io Code: https://github.com/VideoVerses/AnimationBench

Via

Access Paper or Ask Questions

MLCTR: A Fast Scalable Coupled Tensor Completion Based on Multi-Layer Non-Linear Matrix Factorization

Sep 04, 2021

Ajim Uddin, Dan Zhou, Xinyuan Tao, Chia-Ching Chou, Dantong Yu

Figure 1 for MLCTR: A Fast Scalable Coupled Tensor Completion Based on Multi-Layer Non-Linear Matrix Factorization

Figure 2 for MLCTR: A Fast Scalable Coupled Tensor Completion Based on Multi-Layer Non-Linear Matrix Factorization

Figure 3 for MLCTR: A Fast Scalable Coupled Tensor Completion Based on Multi-Layer Non-Linear Matrix Factorization

Figure 4 for MLCTR: A Fast Scalable Coupled Tensor Completion Based on Multi-Layer Non-Linear Matrix Factorization

Abstract:Firms earning prediction plays a vital role in investment decisions, dividends expectation, and share price. It often involves multiple tensor-compatible datasets with non-linear multi-way relationships, spatiotemporal structures, and different levels of sparsity. Current non-linear tensor completion algorithms tend to learn noisy embedding and incur overfitting. This paper focuses on the embedding learning aspect of the tensor completion problem and proposes a new multi-layer neural network architecture for tensor factorization and completion (MLCTR). The network architecture entails multiple advantages: a series of low-rank matrix factorizations (MF) building blocks to minimize overfitting, interleaved transfer functions in each layer for non-linearity, and by-pass connections to reduce the gradient diminishing problem and increase the depths of neural networks. Furthermore, the model employs Stochastic Gradient Descent(SGD) based optimization for fast convergence in training. Our algorithm is highly efficient for imputing missing values in the EPS data. Experiments confirm that our strategy of incorporating non-linearity in factor matrices demonstrates impressive performance in embedding learning and end-to-end tensor models, and outperforms approaches with non-linearity in the phase of reconstructing tensors from factor matrices.

Via

Access Paper or Ask Questions